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Abstract: A new type of redescending M-estimators is constructed, based on 
data augmentation with an unspecified outlier model. Necessary and suffi- 
cient conditions for the convergence of the resulting estimators to the Huber- 
type skipped mean are derived. By introducing a temperature parameter the 
concept of deterministic annealing can be applied, making the estimator in- 
sensitive to the starting point of the iteration. The properties of the annealing 
M-estimator as a function of the temperature are explored. Finally, two ap- 
plications are presented. The first one is the robust estimation of interaction 
vertices in experimental particle physics, including outlier detection. The 
second one is the estimation of the tail index of a distribution from a sample 
using robust regression diagnostics. 

Zusammenfassung: Bin neuer Typ von wiederabsteigenden M-Schatzern 
wird konstruiert, ausgehend von Datenerweiterung mit einem unspezifizierten 
AusreiBermodell. Notwendige und hinreichende Bedingungen fiir die Kon- 
vergenz zu Rubers "Skipped-mean"-Schatzer werden angegeben. Durch Ein- 
fiihrung einer Temperatur kann die Methode des "Deterministic Annealing" 
angewendet werden. Der Schatzer wird dadurch unempfindlich gegen die 
Wahl des Anfangspunkts der Iteration. Die Eigenschaften des Schatzers als 
Funktion der Temperatur werden untersucht. SchlieBlich werden zwei An- 
wendungen vorgestellt. Die erste ist die robuste Schatzung von Wechsel- 
wirkungspunkten in der experimentellen Teilchenphysik, einschlieBlich der 
Erkennung von AusreiBern. Die zweite ist die Schatzung des "Tail index" 
einer Verteilung aus einer Stichprobe mittels robuster Regressionsdiagnostik. 

Keywords: Redescending M-estimator, Deterministic annealing. Robust re- 
gression. Regression diagnostics. Tail index estimation. 



1 Introduction 



M-estimators were first introduced by 'Ruber (2004') as robust estimators of location and 
scale. Their study in terms of the influence function was undertaken by Rampel and 



co-workers ( [Rampel et al.[ [1986] ). Redescending M-estimators are a special class of M- 
estimators. They are widely used for robust regression and regression clustering, see 
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e.g. 



Muller|([2004[) and the references therein. According to the definition in|Hampel et al. 



( 1986 1, the -i/^-function of a redescending M-estimators has to disappear outside a certain 
central interval. Here, we merely demand that the -(/'-function tends to zero for |a;| — > oo. 
If ip tends to zero sufficiently fast, observations lying farther away than a certain bound 
are effectively discarded. Redescending M-estimators are thus particularly resistant to 
extreme outliers, but their computation is afflicted with the problem of local minima and 
a resulting dependence on the starting point of the iteration. 

The problem of convergence to a local minimum can be cured by combining the iter- 
ative computation of the M-estimate with a global optimization technique, namely deter- 
ministic annealing. For a review of deterministic annealing and its applications to clus- 
tering, classification, regression and related problems see Rose (1998]) and the references 
therein. To the best of our knowledge, the combination of M-estimators with determin- 
istic annealing has been proposed only by Li, ( , 1996, ). It will be shown below, however, 
that his annealing M-estimators have infinite asymptotic variance at low temperature, a 
feature that we deem to be undesirable in certain applications. 

The purpose of this note is to construct a new type of redescending M-estimators 



with annealing that converge to the Huber-type skipped mean (Hampel et al. 1986 1 if the 
temperature T approaches zero. The starting point is a mixture model of data and out- 
liers. Data augmentation is used to formulate an EM algorithm for the estimation of the 
unknown location of the data. The EM algorithm is then interpreted as a redescending 
M-estimator that can be combined with deterministic annealing in a natural way (Sub- 
sections 2.1 and 2.2| ). The most important case is a normal model for the data, but other 
models are possible. In Subsection 2.3 conditions are derived under which the corre- 
sponding M-estimator converges to the skipped mean. Section [3] explores the properties 
of the annealing M-estimator with a normal data model and illustrates the effect of de- 
terministic annealing on a simple example with synthetic data. Section |4] presents two 
applications of the annealing M-estimator: first, robust regression applied to the problem 
of estimating an interaction vertex in experimental particle physics; and second, regres- 
sion diagnostics applied to the problem of estimating the tail index of a distribution from 
a sample. 



2 Redescending M-estimators with 
Deterministic Annealing 

This section shows how a new type of redescending M-estimators can be constructed 
via data augmentation. The estimators are then generalized by introducing a temperature 
parameter so that deterministic annealing can be applied. The case of data models other 
than the normal one is discussed, and conditions for convergence to the skipped mean are 
derived. 
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2.1 Construction via Data Augmentation 

The starting point is a simple mixture model of data with outliers, with the p.d.f. 

h{x)=p- f{x;fi,a) + {l~p)- g{x). (1) 

By assumption, f{x; fi, a) is the p.d.f. of the normal distribution with location fi and scale 
a and can be written as 

f{x;fi,a) = (p{r), with r = (x - /i) /cr, 

Lp{.) being the standard normal density. The distribution of the outliers, characterized by 
the density g{x), is left unspecified. 

Now let (xi, . . . , Xn) be a sample of size n from the model in Eq. ([T]). The sample is 
augmented by a set of indicator variables Ij,j = 1, . . . ,n, where Ij = 0(1) indicates that 
Xj is an inlier (outlier). If the scale a is known, the location can be estimated by the EM 



algorithm (Dempster et al. 1977 1. In this particular case, the EM algorithm is an iterated 
re- weighted least- squares estimator, the weight of the observation Xj being equal to the 
posterior probability that it is an inlier. The latter is given by Bayes' theorem: 

P(j. = olx .) = P(x,|/, = 0)-P(/, = 0) 

^ ' ' P{xj\Ij = 0) ■ P{Ij = 0) + P{xj\Ij = 1) ■ P{I, = 1) ^ ^ 

As we do not wish to specify the outlier distribution, we resort to a worst case scenario 
and set P{Ij = 0) = P{Ij = 1) = 0.5. In addition we require that in the vicinity of yU, 
an observation should be an inlier rather than an outlier, so P{Ij = l|x) < P{Ij = 0\x) 
for |x — ;u|/cr < c, c > 0, where c is a cutoff parameter. This can be achieved by setting 
the prior probability P{xj\Ij = 1) that Xj is an outlier to P(/i + ca\Ij = 0) = (^{0). The 
posterior probability P(Ij = 0\xj) then reads: 

f{xj;n,a) + ip{c) (p{rj) + ip{c)' 

where Vj = {xj — fi)/cr. If rj = c, the posterior probabilities of Xj being an inlier or an 
outlier, respectively, are the same. 

If the inlier density is normal the EM algorithm is tantamount to an iterated weighted 
mean of the observations: 

n n 

i,^^^^) = Y^wfx,/Y,^f\ with 

i=i i=i 

(.) v-(^f) , 

Wj = , and 



The iterated weighted mean can also be interpreted as an M-estimator of location (Huber 



20041, with 



rip(r) f 

V'('^;c) = ^^ TT' p(r; c) = / ^(r; c) dr. 

(p{r) + Lp{c) J 

This interpretation allows us to analyze the estimator in terms of its influence function 
and associated concepts such as gross-error sensitivity and rejection point. 
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2.2 Introducing a temperature 

The shape of the score function p(r; c) can be modified by introducing a temperature 
parameter T into the weights. This allows to improve global convergence by using the 
technique of deterministic annealing ([Rose, 1998; Li, 1996| ). The modified weights are 
defined by 



w{r] c, T) 



Vp(r/yT) 



exp(-rV2T) 



<^(r/v^) + v{c/Vf) exp(-rV2T) + exp(-cV2T) ' 



(4) 



The redescending M-estimator with this weight function is called a normal-type or N-type 
M-estimator. Its ^/^-function is given by 



r exp(— r^/2T) 



exp(-r2/2T) + exp(-cV2T) ' 



and its p-function by 

p(r; c, T) = — - T In (exp(rV2T) + exp(cV2T)) + T In (l + exp(cV2T)) . (5) 

Figure [T] shows the weight function, the ^-function and the p-function of the N-type M- 
estimator for three different temperatures (T = 10,1,0.01). Note that Eq. ([5]) is not 
suitable for the numerical computation of p(r; c, T) if T is very small. A numerically 
stable version of Eq. ([5]) is given by: 



p{r;c,T) = < 



^r^ 1 + exp(-cV2T) 

2 ''l + exp((r2-c2)/2T) 



l + exp(-cV2T) 
2 ^l + exp((c2-r2)/2T) 



if |r| < c, 
if Irl > c. 



If the temperature increases, the weight drops more slowly as a function of r. In the 
limit of infinite temperature we have 

lim w{r; c, T) = -, 

T — ^oo 2 

for all c, and the M-estimator degenerates into a least-squares estimator. If the temperature 
drops to zero, the weight function converges to a step function. 

Proposition 1. Let H(-) be the unit step function (Heaviside function) with the additional 
convention H(0) = 1/2. Then 



lim w(r: c, T) = Hie — r). 

T — >o 



□ 



Proposition [T] follows from the more general Proposition [2] below. 
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2.3 Non-normal data models 



The density Lp used in defining the weight function in Eq. (|4]) need not be the standard 
normal density. In fact, every unimodal continuous symmetric density f{x) with location 
0, scale 1 and infinite range generates a type of redescending M-estimators. The behaviour 
of the weight function Wf{r] c, T) at low temperature is determined by the tail behaviour 
of /(x), as described by the concept of regular variation at infinity (Seneta 1916) . We 
recall that a function f{x) : [0, oo) — > (0, oo) is called regularly varying at infinity with 
index ^ e M if it satisfies 

/(Ax) 



lim 

X — >oo 



for any A > 0. 



If /(x) is a probability density function, ^ has to be in the interval (— oo, — 1) 
nition can be extended in the obvious sense to ^ = — oo. If ^ = — oo. 



The defi- 



lim 

X — ^oo 



/(Ax) 

/(^) 







oo 



for A > 1, 
for A < 1. 



In this case /(x) is also called rapidly varying at infinity (Seneta 1976 1. Normal densities 



are rapidly varying at infinity, as are all densities with exponential tails. 
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Proposition 2. 

(a) Let H(-) be as in Proposition [TJ /(r) is rapidly varying at infinity if and only if 

lim Wf{r; c, T) = H{c — r). 

for all c> 0. 

(b) /(r) is regularly varying at infinity with index ^ G M if and only if 



lim Wf{r] c, T) 



T — s>o ' ' _|_ 7--? -|- c"? ' 

for all c> 0. □ 
The proof is omitted, but can be obtained from the authors on request. 
Example 1 (The Hyperbohc Secant Distribution). 

The hyperbolic secant distribution is a symmetric distribution with exponential tails. The 
standardized density is equal to 

h{r) 



2cosh(r7r/2)' 

The -^-function of the corresponding (HS-type) redescending M-estimator is shown in 
Fig. [2];a), for three different temperatures (T = 10, 1, 0.01). It is easy to show that h{r) 
is rapidly varying at infinity. According to Proposition [2| the weight function Wh{r; c, T) 
converges to H(c — r) for T — ¥ 0, and the corresponding M-estimator approaches the 
skipped mean. □ 

Example 2 (Student's t-Distribution). 

Student's t-distribution is a symmetric distribution with tails falling off according to a 
power law. The standardized density with u > 2 degrees of freedom is equal to 

The ^-function of the corresponding (t,^-type) redescending M-estimator with z/ = 3 is 
shown in Fig.[2]^b), for three different temperatures (T = 10, 1, 0.01). The density ti/(r) 
is regularly varying at infinity with index ^ = — (z/ + 1). From Proposition 2 follows: 



lim Wt,{r;c,T) 



For u — > 00, this function approaches the step function H(c — r). □ 



3 N-type M-estimators of location 

Li this section the properties of the annealing M-estimator with a normal data model are 
explored. The effect of deterministic annealing on the objective function of the estimator 
is illustrated on a simple example with two clusters (data and outliers). 
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Figure 2: ip{r; c,T) of redescending M-estimators of (a) hyperbolic secant-type and (b) 
Student's h-type , for c = 2.5 and T = 10, 1, 0.01. 



3.1 Basic properties 

The influence function is always proportional to ip: 

IF(r; ij{r- c, T),F) = ^(r; c, T)/K{c, T). 



If the model distribution is the standard normal distribution, K{c, T) is given by (Ham- 



pel et al. 1986): 



K(c,T) = / r ip{r; c,T) (^(r) dr = 2 / r ip(r; c,T) Lp{r) dr. 
Jr Jo 

Unfortunately, the integral cannot be written in closed form. Figure [3|a) shows K{c, T) 
as a function of T, for c = 1.5 : 0.5 : 3, computed by numerical integration. Complications 
at very small values of T can be avoided by splitting the interval of integration [0, oo) at 
c. The low- and high-temperature limits can be computed explicitly: 

lim ir(c,T) = 2 $(c)-l-2cv.(c), lim K{c,T) = \. 

The point of maximum influence can be computed by means of the Lambert W^-function 
dCorless et aL|[T996l ): 



r„ 



[c,T) = v/2Tw(c,T) +T, with u;(c, T) = exp(cV2r - 1/2)). 



Tmax is showu in Figure |3];b). The maximum value of the influence function is the gross- 
error sensitivity 7* : 

*f ^\ iT?f ^\ ij{rm^^{c,T);c,T) 1 2VTuj{c,T) 

7 c, r = max IF r: c, T = — — — = — — — — . 

^ ^ KicT) KicT) ^2uj{c,T) + l 

Figure [3]^c) shows the gross-error sensitivity as a function of T, for c = 1.5 : 0.5 : 3. The 
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Figure 3: N-type M-estimator: (a) K{c,T), (b) r^^x, (c) gross error sensitivity, (d) ef- 
fective rejection point for e = 10~^ and (e) asymptotic variance, as a function of the 
temperature T, for c = 1.5 : 0.5 : 3. 



minimum value lies in the range 1 < T < 2, so if one aims to minimize 7*, the final 
temperature should be chosen in that range. In the low-temperature limit we have 

At T = 0, 7* is minimal for c ~ 2.14. The weight function w(r; c, T) is always positive, 
so the M-estimator does not have a finite rejection point. However, an effective rejection 
point can be computed for a threshold e: 

p:ff(c, T,£) = sup {r:IF(r;c,T)>£}. 

Figure [3|d) shows the effective rejection point for e = 10^^. In the limit T — y the 
effective rejection point approaches the cutoff value c. Finally, the asymptotic variance at 
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the standard normal distribution, given by 

T.. rj.. _ LHr;c,T)ip{r) dr 
' K{c,Ty 

is shown in Figure |3];e). The low- and high-temperature limits are given by: 

lim V(c,T) = —— ^ — , lim V(c,T) = 1. 

T^o ^' ^ 2<l>(c) - 1 - 2c(^(c) r^oo ^' ^ 

Figure|3]shows that, for a given cutoff value c, it is not possible to minimize the gross-error 
sensitivity and the rejection point at the same time. The choice of the stopping temperature 
therefore depends on the problem at hand. If the asymptotic efficiency is important the 
cutoff value c should be between 2.5 and 3, at the cost of a somewhat higher gross-error 
sensitivity and a larger rejection point. Cutoff values larger than 3 are not recommended. 



3.2 Effect of Deterministic Annealing 

The effect of deterministic annealing on the minimization of the objective function of the 
N-type estimator is illustrated on a simple problem of location estimation with synthetic 
data. The data are generated from the following mixture model with mean-shift outliers 
(see Eq. 

h{x) = p ■ ip{x) + (1 — p) ■ (p{{x — m)/a). 

We have chosen the following mixture parameters: 

p = 0.7, m = 6, cr = 1, 

which results in two barely separated standard normal components. An example data 
set with 500 observations is shown in Figure |4j There are 364 inliers and 136 outliers. 
The scale estimate s is computed by taking the median of the absolute deviations from the 
half-sample mode, which in this situation is a better measure of the inlier location than the 



sample median (Bickel and Friihwirth 2006 1. Its normal-consistent value for the example 



data is 1.31, whereas the normal-consistent MAD is equal to 1.56. The cutoff has been 
set to c = 2.5. 

It is instructive to observe the evolution of the objective function 



M(/x; c, T) = ^ p{{xi - /i)/s; c, T) 



i=l 



with falling temperature T (see Figure |5]). At large T, the weights are nearly independent 
of the residuals, and the objective function is almost quadratic. If the temperature is de- 
creased, the objective function starts to reflect the structure of the data, eventually showing 
two clear local minima. These minima could be used to detect clusters in the data ( [Garlipp 



and Miiller 2005 1. As the objective function is minimized at each temperature, the final 



estimate is now totally independent of the starting value. As long as the high-temperature 
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Figure 4: Example data set with 500 observations from a mixture of two standard normal 
distributions. The difference of the means is equal to six. There are 364 inliers and 136 
outliers. 



minimum is closer to the deeper low-temperature minimum convergence to the latter is 
virtually guaranteed. 

If the separation m between inliers and outliers is decreased, the final objective func- 
tion eventually has a single minimum. Figure [6] shows the final objective function at 
T = 0.1 for m = 6, 5, 4, 3. At m = 5 the second local minimum has disappeared, but 
the objective function still has a point of inflection close to the outlier location, and the 
estimate is unbiased. At m = 4 the point of inflection is barely visible, and the estimate 
shows a small bias. At m = 3 the point of inflection has disappeared, and the estimate 
shows a clear bias. In contrast, the median and the half-sample mode are totally unaffected 
by the change in separation. 

Deterministic annealing in combination with redescending M-estimators has already 
been proposed byO('l996). One of the weights function used there is a modified Welsch 
estimator with the weight function 

w{r;T) = exp(— r^/2T), 

which is equal to the numerator of the N-type weight function in Eq. (|4]). It is easy to 
show that the asymptotic variance of this estimator at the standard normal distribution is 
equal to 

V(T) = ^ ^ 

^ ' (2 + T)3/2T3/2' 

and consequently 

lim V(T) = oo. 

T — >0 



o 
c 

cr 



_3 

< 



20 



15 
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The same holds for the other two weight functions proposed by |Lil ( [T996] ). 
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5 10 5 10 

Figure 5: Evolution of the objective function M(/i; c, T) with falling temperature. The 
open circle (O) is the starting point of the iteration at the respective temperature, the 
x-mark (x) is the final estimate at the respective temperature. 




Figure 6: The objective function M(/x; c, T) at the final temperature T = 0.1, for different 
values of the mean shift m between inliers and outliers. The open circle (O) is the starting 
point of the iteration, the x-mark (x) is the final estimate. 
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Figure 7: A primary vertex with several outgoing tracks. 



4 Applications 

In this section we present two application of the annealing M-estimator. In the first one 
the estimator is applied to the problem of estimating robustly the interaction vertex of 
a particle collision or a particle decay. The results show that annealing is instrumental 
in identifying and suppressing the outliers. In the second application the annealing M- 
estimator is used for regression diagnostics in the context of the estimation of the tail 
index of a distribution from a sample. 

4.1 Robust regression and outlier detection 

The N-type estimator can be applied to robust regression with minimal modifications. The 
procedure is illustrated with the following problem from experimental particle physics. 

An interaction vertex or briefly vertex is the point where particles are created by a 
collision of two other particles, or where an unstable particle decays and produces two or 
more daughter particles. The position of the vertex has to be estimated from the parame- 
ters of the outgoing particles, the so-called track parameters. The track parameters consist 
of location, direction and curvature. They have to be estimated before the vertex can be 
estimated. As an illustration. Figure |7] shows a primary vertex, the interaction point of 
two beam particles in the accelerator, and several outgoing tracks. The precision of the 
estimated track parameters is indicated by the width of the tracks. 

The least-squares (LS-)estimator of the vertex position v minimizes the sum of the 
squared standardized distances of all tracks from the vertex position v: 




The distance di is approximated by an affine function of v, obtained by a first-order Taylor 
expansion of the track model, which is the solution of the equation of motion of the 
particle: 

di{v) ^ Ci + a.-i^v. 
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The cTj^ are known from the estimation procedure of the track parameters. 
With the redescending N-type M-estimator each track gets a weight Wi'. 

exp(-riV2T) 



exp(-ri72r) + exp(-cV2r) ' 

As a consequence, outlying or mis-measured tracks are downweighted by a factor Wi. As 
the factor Wi depends on the current vertex position v, the M-estimator is computed as an 
iterated reweighted least-squares estimator. The dependence on the starting point is cured 
by annealing. The final weights can be used for a posterior classification of the tracks as 
inliers (wi > 0.5) or outliers (Wi < 0.5). 

In our example we have used simulated events from the CMS experiment (ICMS Col- 



laboration. |1994[ 2007 1 at CERN, real data not yet being available. We have studied the 



estimation of the primary (beam-beam collision) vertex. For more details about the es- 



timation problem, see Waltenberger et al. (2007 1. The primary particles produced in the 



beam-beam collision are the inliers, whereas short-lived secondary particles produced in 
decays of unstable particles are the outliers, along with mis-measured primary tracks. Pri- 
mary and secondary tracks can be identified from the simulation truth. Estimation of the 
primary vertex was done by the N-type M-estimator, with the least-squares estimator as 
the starting value. The annealing schedule was Tq = 256, Tj+i = Tend + liTi — Tend), 
with q = 0.25. 

The results are summarized in Table [T| The first column shows the type of annealing 
used, the second and third columns show the classification of the primary tracks by their 
final weights, the fourth and fifth columns show the classification of the secondary tracks, 
and the last column shows how many vertex estimates were within 100 fxm of the true 
vertex position, known from the simulation. Without annealing, the N-type M-estimator 
performs better at T = 1 than at T = 0.01. However, the results show that annealing 
is essential for the correct classification of primary and secondary tracks. The natural 
stopping temperature of the annealing procedure is Tend = 1, but cooling to Tend = 0.01 
gives a slight improvement. 

A similar method can be employed for the estimation of the track parameters. In this 
case, several observations may compete for inclusion into the track, and the computation 



of the weights has to be modified accordingly (Friihwirth and Strandlie, 1999) 



Table 1: Results of vertex estimation with the N-type M-estimator, using simulated data. 
For details see the text. 





inliers 


outliers 


vertices 


Annealing schema 


w<0.5 


w>0.5 


w<0.5 


w>0.5 




No anneahng, T = 


1 


0.312 


0.688 


0.859 


0.141 


1422 


No annealing, T = 


0.01 


0.512 


0.488 


0.899 


0.101 


1004 


Annealing, Tend = 


1 


0.101 


0.899 


0.828 


0.172 


1913 


Annealing, Tend = 


0.01 


0.092 


0.908 


0.829 


0.171 


1939 
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(a) Student's t, v=2 (b) Student' s t, v=4 




Figure 8: Pareto quantile plots of two samples of size n = 1000 from the t-distribution, 
with (a) z/ = 2 and (b) z/ = 4, respectively. 



4.2 Tail index estimation 

The tail index a of the distribution of a random variable X is defined by 

a = sup{(5 > : E(|X|^) < oo}. 
The tail index determines how many moments of X exist. A consistent estimator of 



from a sample (xi, . . . , x„) is the Hill estimator (Hill 1975 1 



a. 



1 



(n-fe)- 



As pointed out by |Beirlant et al.| ( [1996[ ), the choice of is a problem of regression diag- 
nostics. This can be understood by looking at the Pareto quantile plot of the sample. The 
latter is a scatter plot {xj,yj),j = 1, . . . , n, with 



log 



n + 1 



yj = \ogX^n~j+i), j = 1, 



n. 



As an example, Figure [8] shows the Pareto quantile plot of two samples from the t- 
distribution, with u = 2 and z/ = 4, respectively. The sample size is n = 1000. 

If a fine is fitted to the linear part of the plot, its slope is an estimate of 1/a. The 
problem is therefore to find the linear part of the plot. It is worth noting that standard 
robust regression methods such as LMS or LTS ( [Rousseeuw and Leroy , |1987 ) will fail, 
as by definition the tail is not the majority of the data. 

The N-type M-estimator can be used for regression diagnostics in order to find the 
linear part of the Pareto quantile plot. The algorithm is based on the idea of the forward 
search ( [Atkinson and Rianl] |2000[ ) and is composed of the following steps: 
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Figure 9: (a) Optimal proportion popt of the sample and (b) optimal RMSE of the Hill 
estimator, as a function of v. 



Algorithm A 

Al. Compute the scale of yi, using the asymptotic expression for quantiles and a kernel 
estimator for the probability density function. As the kernel estimator is unreli- 
able in the very extreme part of the tail, the largest half percent of the sample is 
discarded. 

A2. Fit a robust regression line with the N-type M-estimator to the m largest points in 
the Pareto quantile plot. The starting line is the LMS regression line. The tempera- 
ture is set to T = 1 . 

A3. Freeze all weights and extend the fit successively to the lower portion of the plot, 
by adding m points at a time. The choice of m is a trade-off between speed and 
safety. 

A4. Stop adding points when the new weights get too small, indicating failure of the 
linear model. 



We have tested the algorithm on samples from the t-distribution with v degrees of 
freedom, with n = 1000 and u = 1: 0.5 : 10. The baseline is the Hill estimator using the 
optimal value of k. The latter was found for each value of v by computing Hill estimators 
with different values of k and choosing the one that minimizes the root mean-square error 
(RMSE) of with respect to the true value a^^ = Xjv. Figure |9] shows the optimal 
proportion p = k/n and the RMSE of the corresponding Hill estimators as a function 
of V. 

Algorithm A as described above was run with m = 10, i.e. one percent of the sample 
size. The cutoff parameter c was adjusted at the 99%-quantile of the Xi distribution, i.e. at 
c = 2.576. The fit was stopped as soon as at least half of the new weights were smaller 



than 99% of the maximum weight Wmax = + exp(— c /2)] = 0.965. Figure 10 



summarizes the results. The left hand panel (a) shows box plots of the proportion of the 
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Figure 10: (a) Proportion p of the sample used by Algorithm A and (b) resulting RMSE 
of Algorithm A, as a function of v. 



sample included in the regression, one for each value of z/. The right hand panel (b) shows 
the resulting RMSE of with respect to the true value = ^lv, for all v. The figure 
clearly shows that there is a tendency to include more data points than required for the 
optimal estimate. As a consequence, the RMSE is somewhat larger than in the optimal 
case. On the other hand, in a real life situation no external information at all may be 
available about the optimal value of A;. In this case the regression diagnostics approach, 
which is entirely driven by the data, is a viable alternative. 

5 Summary 

A new type of redescending M-estimators has been introduced, suitable for combination 
with deterministic annealing. It has been shown that the annealing M-estimator con- 
verges to the skipped mean if and only if the inlier density is rapidly varying at infinity. 
Deterministic annealing helps to make the estimator insensitive to the starting point of 
iteration. Possible applications are location estimation, robust regression and regression 
diagnostics. The new type of estimators is particularly useful if the scale of the observa- 
tions is known. In other cases the scale has to be estimated from the data, preferably in a 
robust way. 
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